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BACKGROUND 



1. Field 

The present invention relates to the field of data communications. In 
particular, this invention relates to a system and method for enhancing the 
5 reliability of voice activity detection. 



2. General Background 

For many years, discontinuous transmission (DTX) systems have been 
installed to conserve bandwidth over packet voice/ data networks. Bandwidth 
conservation is accomplished by detecting when a caller is speaking and 

10 transmitting speech packets generated by a speech coder during those periods 
of time. For the remaining periods of time when the caller is not speaking, 
certain DTX systems have been configured to transmit a background noise 
level tracked by a voice activating detector. This background noise level is 
subsequently used to replicate the background silence gaps between 

15 communications, which are a considerable portion of normal speech 
communications. 

Conventional DTX systems consist of a voice activity detector (VAD) 
and a comfort noise generator (CNG). Normally, a "voice activity detector'' 
(VAD) is software processed by circuitry to digitize an analog signal (e.g., voice 
20 and /or background noise) and to determine whether or not a particular 

segment of the digitized analog signal represents a person's voice. Since the 
range of a person's voice is dynamic, in some situations varying 20-40 decibels 
(dB), and background noise can vary moment to moment, a number of 
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different parameters have been used by conventional VADs to discern voice 
activity. 



For example, an IEEE publication entitled "Application of an LPC 
distance measure to the voice-unvoiced-silence detection problem/' authored 
5 by L.R. Rabiner and M.R. Sambur, describes a voice activity detector (VAD) 
performing a pattern recognition approach on incoming digitally sampled 
signals to detect voice activity. In particular, this VAD creates templates of 
parameters for voiced, unvoiced (e.g., tailing off sounds for certain words) and 
silence segments of speech. Each template includes five parameters: the 

10 energy of the signal (Eg); the zero-crossing rate of the signal (Nz); the 

autocorrelation coefficient at unit sample delay (Cl); the first order predictor 
coefficient (Ai); and the normalized prediction error (Ep). Through 
probability calculations, decision logic compares the templates with a sampled 
segment of an incoming signal to determine whether the segment represents 

15 voice, unvoice or silence. The disadvantage associated with this VAD is that it 
is extremely difficult to find a set of reliable templates to distinguish between a 
variety of speech signals and numerous levels of background noise found in 
different environments. 

Another example of VAD involves the use of linear prediction 
20 coefficients (LPC) which are calculated in the speech coder. While taking 
advantage of the LPCs calculated in the speech coders reduce computational 
power consumption by the VAD, it also has encountered a number of 
disadvantages. For example, speech coders in accordance with the 
International Telegraph and Telephone Consultive Committee (CCITT) 
25 G.729B standards perform linear predictive coding differently than speech 
coders in accordance with CCITT G.723 standards. As a result, there does not 
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exist a VAD which can be used by virtually all types of speech coders. Instead, 
depending on the type of speech coder implemented, the VAD must be 
modified to operate in combination with that speech coder. This increases 
overall ownership costs and the difficulty in upgrading the DTX system. 



achieve bandwidth efficiency. In a VOICE mode, a selected speech coder is 



decompressing the voice signals upon reception. In a SILENCE 
SUPPRESSION mode, only the background noise level signal is transmitted, 
from which white noise is regenerated at the destination. 

Currently, two parameters are used by this universal VAD function in 
15 order to determine whether the voice/data networking product is operating in 
a VOICE mode or a SILENCE SUPPRESSION mode. These parameters include 
(i) short-term tracking energy and (ii) long-term tracking energy. The ''short- 
term tracking energy'' is an accumulation of signal energy associated with 
voice signaling and background noise level, and thus, is represented by 
20 equation (1). 



5 



Over the last few years, MICOM Communications Corporation of Simi 
Valley, California, has produced voice/data networking products for DTX 
systems that utilize a universal energy-based VAD, The voice/data 
networking products includes a dual-mode speech coding function in order to 



10 responsible for compressing voice signals before transmission and for 



(1) E,,k(k) = a X Edb(k) -h (1 - a) x E,,^{k - 1), 





otherwise. 
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EdB(k) denotes the current frame energy in decibels and is 



2 



equivalent to the following: 10 log^o X^(^) 
represents the number of samples per frame. 



where ''N' 



Etrk(k-l) denotes the short-term tracking energy for the 
5 previous frame. 

The "long-term tracking energy'' represents the background noise level 
associated with incoming audio and is measured by equation (2). 

(2) Ei(k) = min{pEi(k - 1) + (1 - msi^l^ma.}. where 
P=0.875; and 

10 Emax denotes the maximum backgroimd level. 

As a result, when the calculated value of the long-term tracking energy 
approaches the calculated value of the short-term tracking energy, the VAD 
predicts that a segment of sampled signals associated with a current frame is 
likely to be silence. One problem that has been encountered is that this 
15 conventional VAD is subject to increased switching between VOICE mode and 
SILENCE SUPPRESSION mode during long periods of silence, where the long- 
term tracking energy naturally approaches the short-term tracking energy. 
This increasing switching, referred to as "in/out effects,'' causes audio volume 
fluctuations detectable by the human ear. 

20 Hence, it would be advantageous to provide a system and method for 

enhancing reliability of voice activity detection through development of an 
improved, universal VAD which relies on a peak-to-mean likelihood ratio. 
The peak-to-mean likelihood ratio reduces the occurrence of the in/ out 
effects by further assisting the VAD, in certain instances, to determine 

25 whether an incoming analog signal represents voice or silence. 
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SUMMARY OF THE INVENTION 



The present invention relates to a voice activity detector, being either 
software executable by a processing unit or firmware, which predicts whether 
an audio frame represents a voice signal or silence. This prediction is based 
5 the analysis of a number of parameters, including a short-term averaged 
energy (STAE), a long-term averaged energy (LTAE), and a peak-to-mean 
likelihood ratio (PMLR). 

In one embodiment, to predict whether a frame represents voice or 
silence, an initial determination is made whether a sum of the STAE and a 

10 factor is greater than the LTAE. If the sum is less than the LTAE, the audio 
frame represents silence. Otherwise, a second determination is made as to 
whether the difference between the LTAE and the STAE is less than a 
predetermined threshold. In the event that the difference between the LTAE 
and the STAE is less than the predetermined threshold, the PMLR is 

15 determined and compared to a selected threshold. If the PMLR is greater than 
the selected threshold, the audio frame represents a voice signal. Otherwise, 
it represents silence. 



003239.P010 



-5- 



WWS/wlr 



BRIEF DESCRIPTION OF THE DRAWINGS 



The features and advantages of the present invention will become 
apparent from the following detailed description of the present invention in 
which: 

5 Figure 1 is an illustrative diagram of a system comprising a first 

networking device operating in accordance with the present invention. 

Figure 2 is an illustrative diagram of an embodiment of a 
communication module employed within the first networking device of 
Figure 1. 

10 Figure 3 is an illustrative flowchart of the operations of the first 

networking device of Figure 1. 

Figure 4 is an illustrative block diagram of the data structure of a 
service frame. 

Figure 5 is an illustrative block diagram of the data structure of a 
15 silence suppression frame. 

Figure 6 is an illustrative flowchart of the operations of the second 
networking device. 

Figure 7 is an illustrative block diagram of the operations of the 
comfort noise generator. 

20 Figure 8 is an illustrative flowchart of the operations of the voice 

activating detector. 
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Figure 9 is an illustrative block diagram of hardware for calculating the 
average peak-mean ratio. 

Figure 10 is an illustrative block diagram of a state diagram of a 
decision smoothing state machine for further reduction of in/out effects. 
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DETAILED DESCRIPTION OF AN EMBODIMENT 



Herein, embodiments of the present invention relates to a system and 
method for enhancing rehability in voice activity detection. This is 
accomplished by an improved voice activity detector in which an additional 

5 parameter, a peak-to-mean likelihood ratio (PMLR), is used in combination 
with long-term averaged energy and short-term averaged energy parameters 
to determine whether various segments of audio constitute voice or silence. 
The use of the peak-to-mean likelihood ratio by the voice activity detector 
will reduce audio degradation currently experienced by conventional DTX 

10 systems. 

Herein, certain terminology is used to describe various features of the 
present invention. In general, a "system'' comprises one or more networking 
devices coupled together through corresponding signal lines. A "networking 
device" comprises a digital platform such as, for example, a MARATHON*^^ 

15 frame relay product by Nortel/MICOM, a voice-over Asynchronous Transfer 
Mode (ATM) product such as Passport 4740^^' by Nortel/MICOM, cellular 
telephones operating in accordance with a cellular communication standard 
(e.g., GSM) and the like. Such a digital platform usually comprises software 
and /or hardware to perform analog to linear conversion, echo cancellation, 

20 speed coding, etc.. A "signal line" includes any communications link capable 
of transmitting digital information at some ascertainable bandwidth. 
Examples of a signal line include a variety of mediums such as Tl/El, frame 
relay, private leased line, satellite, microwave, fiber optic, cable, wireless 
communications (e.g., radio frequency "RF") or even a logical link. 
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Additionally, ''information'' generally comprises a signal having one 
or more bits of data, address, control or any combination thereof. A 
"communication module" includes a voice activity detector used to 
determine whether various segments of audio constitute voice or silence. In 
5 this embodiment, the "voice activity detector" (VAD) is software; however, it 
is contemplated that the VAD may be implemented in its entirety as 
hardware or firmware being a combination of hardware and software. 

Referring to Figure 1, an illustrative embodiment of a system utilizing 
the present invention is shown. Herein, system 100 includes a first 

10 networking device (source) 110 coupled to a second networking device 

(destination) 120 via a signal line 130. Herein, networking device 110 receives 
analog audio signals 140 as input and digitizes the audio to produce pulse 
code modulation (PCM) audio for example. The PCM audio is separated into 
multiple frames, where various signal characteristics of each frame are 

15 analyzed by a voice activating detector (VAD) as described below in Figure 8. 
From these signal characteristics, first networking device 110 can determine 
whether to transmit a compressed audio frame (referred to as a "service 
frame") or to transmit a silence suppression frame providing a noise 
background level as described below. 

20 Referring now to Figure 2, first networking device 110 comprises a 

communication module 200. Communication module 200 includes a 
substrate 210 which is formed with any type of material or combination of 
materials upon which integrated circuit (IC) devices can be attached. 
Communication module 200 is adapted to a connector 220 in order to 

25 exchange information with other logic mounted on a circuit board 260 of 

networking device 110 for example. Any style for connector 220 may be used, 
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including a standard female edge connector, a pin field connector, a socket, a 
network interface card (NIC) connection and the like. 

As shown, communication module 200 includes memory 230 and a 
processing unit 240. In this embodiment, memory 230 includes off-chip 

5 volatile memory to contain software which, when executed by processing 
unit 240, performs voice activity detection. Of course, non-volatile memory 
may be used in combination with or in lieu of volatile memory. Processing 
unit 240 includes, but is not limited or restricted to a general purpose 
microprocessor, a digital signal processor, a micro-controller or any other 

10 logic having software processing capabilities. Processing unit 240 includes on- 
chip internal memory (M) 250 to receive information from memory 230 for 
internal storage thereby enhancing its processing speed. 

Referring now to Figure 3, an illustrative flowchart of the operations 
performed by first networking device 110 is shown. Initially, first networking 

15 device 110 receives analog audio and digitizes the audio. For this example, 
the audio may be converted into PCM audio (block 300). The PCM audio is 
modified by an echo canceler (block 310), in order to eliminate echo returned 
from second networking device 120 of Figure 1, and thereafter, each frame of 
the PCM audio is analyzed by a voice activity detector (VAD). For example, 

20 the VAD may be software executed by processing unit 240 of Figure 2 (block 
320). Based on signal characteristics of each PCM audio frame, a 
determination is made whether the frame constitutes voice or silence (block 
330). 

If the frame is determined to be voice, first networking device 110 
25 enters into a VOICE mode. In this mode, the PCM audio frame is loaded into 
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a speech coder which compresses the PCM audio frame to produce a service 
frame as shown in Figure 4 (block 340). The service frame 260 includes a 
header 265 to identify the frame and payload 270 to contain compressed audio. 
Such compression is performed in accordance with any existing or later 
developed compression function. 

Alternatively, if the frame is determined to be silence, first networking 
device enters into a SILENCE SUPPRESSION mode. In this mode, a silence 
suppression frame (see Figure 5) is transmitted to the second networking 
device (block 350). The silence suppression frame 275 comprises a header 280, 
a first field 285 to contain a background noise level being an energy value 
representing the background noise, and a second field 290 to contain the 
complement of the background noise level. The complement is included for 
error checking. This process, inclusive of voice activity detection, continues 
for each PCM audio frame (block 360). 

Referring now to Figure 6, an illustrative flowchart of the operations 
performed by second networking device 120 of Figure 1 is shown. Upon 
receiving a frame of information (block 400), second networking device 120 
determines whether a silence suppression frame has been received (block 
410). If so, the background noise level recovered from the silence suppression 
frame is loaded into a comfort noise generator (CNG). The CNG produces 
comfort noise samples based on the received background level in order to 
avoid audio artifacts such as in-out effects (block 420). 

In particular, as shown in Figure 7, CNG 500 includes linear factor 
calculator 510 to handle various ranges of background noise levels. Each of 
these ranges (in dB) is mapped into a linear factor 520 which is used to scale a 



003239.P010 



-11- 



WWS/wlr 



constant level of noise 530 supplied by a random number generator. The 
scaled white noise 540 is then passed through a first order 1/f filter 550 to 
obtain the pink noise samples. The resultant pink noise is a regeneration of 
the background noise at the source. Thereafter, the pink noise samples are 
5 placed in an analog format (block 430) as shown in Figure 6. 

Referring still to Figure 6, in the alternative event that a service frame 
is detected so no error condition is triggered (blocks 440-450), the service frame 
is transferred to a speech decoder to recover a substantial portion of the 
original PCM audio (block 460). Thereafter, the PCM audio is placed in an 
10 analog format (block 430). 

Referring to Figure 8, an illustrative flowchart of the operations of the 
voice activity detector (VAD) is shown. Initially, each audio frame is collected 
for N samples per frame (block 600). In this embodiment, the sampling 
number "N'' is approximately 80 samples per frame, but may be any number 
15 of samples up to the size supported by a speech coder. After the audio frame 
has been collected, a number of signal parameters are calculated, including 
the short-term averaged energy, the long-term averaged energy, and the peak- 
to-mean likelihood ratio. 

Before calculating the short-term averaged energy and the long-term 
20 averaged energy, the energy associated with the current audio frame is 

calculated (block 610). This is accomplished by squaring each voice sample (sj) 
for the current audio frame and summing the squared result. The frame 
energy is defined by equation (3). 

N-l . 

(3) E= 
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After the current frame energy has been calculated, it is converted into 
a decibel (dB) value (block 620). This provides a larger dynamic range to 
handle a greater energy variance for each sampled audio frame. The frame 
energy (in dB) is calculated as shown in equation (4). 

5 (4) EdB = 101ogio(E) 

After calculating EdB for the current frame, the short term averaged 
energy may be calculated (block 630). The short-term averaged energy (STAE) 
is an accumulation of signal energy associated with successive PCM audio 
frames. The current frame energy EdB and the STAE for the previous frame 
10 are weighted by predetermined factors "a" and "1-a" so that the resultant 
value is the STAE for the current frame. The selection of the factor "a" may 
be set through simulations. Herein, the STAE is defined in equation (5) as: 

(5) Es(k) = a X EdB(k) + (1-a) x Es(k-l), where 
{0.125 if EdB(k)>E3(k-l) 

(X = s 

[0.25 otherwise. 

15 " a" denotes a selected factor of the energy of a current 

PCM audio frame to be added to the accumulated average. 

"EdB(k)'' denotes the current frame energy in decibels; and 

"Es(k-l)" denotes the prior short-term averaged energy 
value. 

20 Along with the STAE, the "long-term averaged energy" (LTAE) is 

calculated (block 640). The LTAE is defined as an additional level of 
accumulation to track the background noise level and, for this embodiment, is 
updated in accordance with equation (6): 
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(6) E (j^)^ WPE,(k-l) + (l-p)E3(k), E^,,}, if E,(k-l)>E,(k) 
^ [min{Ex (k - 1) + SE^ , E^^^ }, otherwise 



where p = 0,875 



5E = 



1 if previous form is voice, 

— otherwise. 
116 



Emax denotes the maximum background level being set to 
5 -30dBmO. 

In the case where Ex(k-l)<Es(k), instead of adaptively updating LTAE, 
we apply a jump (bE^). By doing so, we can update the LTAE promptly when 
there is a sudden change in background noise level 

Next, a peak-to-mean ratio (PMR) is calculated in order to determine 
10 the peak-to-mean likelihood ratio (block 650). The PMR comprises a ratio 
between the absolute value of a maximum sampled signal and the 
summation of the values for all (N) sampled signals for the current frame as 
shown in equation (7). Therefore, as the value of the PMR increases, there is 
a greater likelihood that the current frame represents silence because a 
15 waveform associated with silence has lesser energy than a waveform 
associated with voice. 

(7) PMR = 

IlSii 

i=0 

After the PMR is calculated, an average peak-to-mean ratio (APMR) is 
now determined (block 660) for use in calculating the peak-mean likelihood 
20 ratio (PMLR). The reason for calculating APMR is to prevent frequent 
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alterations between VOICE mode and SILENCE SUPPRESSION mode based 
on environmental conditions (e.g., speaker talks loudly, noisy environment, 
etc.). Consequently, the occurrence of an in /out effect is substantially 
mitigated. 

5 As shown in Figure 9, one technique to calculate the APMR is to 

implement a circular buffer 700 having depth "M". During analysis by the 
VAD, the PMR for that frame is inserted into buffer 700. After each insertion, 
the APMR is calculated by averaging all of the PMRs loaded into buffer 700 
based on equation (8): 

M-l 

10 (8) APMR = — yPMRi 

Mis 

Referring back to Figure 8, it is contemplated that the PMR and APMR 
may be used for voice activity detection. The behavior of PMR or APMR may 
vary, depending on the audible level of the speaker's voice or the backgrotmd 
noise. Thus, in this embodiment, a normalized parameter, namely a peak- 
15 mean likelihood ratio, is calculated and subsequently used to determine 
whether a sampled frame represents voice or silence (block 670). 

More specifically, the peak-mean likelihood ratio (PMLR) is a 
parameter which is compared with a predetermined threshold value to 
determine whether a sampled frame represents voice or silence. This 
20 threshold value is programmed during simulation, allowing a customer to 
select an acceptable tradeoff between voice quality and bandwidth savings. 

As shown in equation (9) below, the PMLR is normalized to 
substantially mitigate modification caused by different speakers and different 
background noise levels. As a result, PMLR has minimal variation between 
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audio frames in order to discourage in/out effects due to frequent switching 
between VOICE mode and SILENCE SUPPRESSION mode. Also, PMLR is 
independent of frame size, and thus, can operate with speech coders 
supporting different frame sizes. 

5 To determine the PMLR, the VAD keeps track of the maximum APMR 

(APMRmax) and the minimum APMR (APMRmin) contained in buffer 700 of 
Figure 9. The contents of buffer 700 may be periodically cleared after a selected 
period of time has expired or after a selected number (S) of calls (S>1). From 
these values and the APMR associated with the current audio frame, the 
10 PMLR can be measured by equation (9). 

(APMR^^^-APMRk) 

In block 680, based on the STAE, LTAE and PMLR parameters, the VAD 
performs a bifurcated decision process to determine whether a sampled audio 
frame is voice or silence. A first determination is whether the combination 
15 of the STAE and a selected factor is greater than the LTAE as shown in 
equation (10). The factor is set based on simulation results, which was 
determined to be 2 dB in this embodiment. Of course, as the factor is 
increased, more bandwidth will be conserved because there is greater 
probability for the system to be placed in a VOICE mode. 

20 (10) STAE + factor(2dB)>LTAE 

If the combination is greater than the LTAE, the sampled audio frame 
is initially considered to be voice. As a result, the VAD performs a second 
determination. This determination involves ascertaining the PMLR when 
the LTAE and the STAE differ by less than a predetermined threshold. The 
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predetermined threshold is determined to be 4 dB in this embodiment. In 
mathematical terms: 



I LTAE - STAE I < Threshold (4dB) 

When this condition is met, the VAD determines whether the PMLR 
5 is less than a selected threshold. The selected threshold is determined to be 
0.50 in this embodiment. If the PMLR is less than the selected threshold, the 
sampled audio frame represents silence. Otherwise, it represents voice. 
Consequently, the PMLR provides a secondary determination when the LTAE 
is approaching the STAE to avoid needless in/out effects. 

10 Once the determination has been made that the sampled audio frame 

is voice or silence, the VAD performs a decision smoothing process (block 
690). The decision smoothing function delays the system from switching 
from the VOICE mode to the SILENCE SUPPRESSION mode immediately 
after the current frame is detected to be silence. This avoids speech clipping at 

15 the end of an utterance. 

Referring now to Figure 10, a state diagram concerning the operations 
of a decision smoothing state machine 800 of the VAD is shown. State 
machine 800 comprises a VOICE (mode) state 810, a SILENCE SUPPRESSION 
state 820 and a HANGOVER state 830. For each sampled audio frame, state 
20 machine 800 determines the operating state of the system. In the 
HANGOVER state 830, the system operates as in the VOICE state. 

As shown, state machine 800 enters or remains in VOICE state 810 if 
the current audio frame is determined to be voice as represented by arrows 
840, 845 and 850. However, when the current audio frame is determined to 
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be silence, the operating mode of the system depends on the current state of 
state machine 800. For example, if state machine 800 is in SILENCE 
SUPPRESSION state 820, state machine 800 remains in that state as 
represented by arrow 855. However, if state machine 800 is in VOICE state 810 

5 and the current audio frame is determined to be silence, state machine enters 
into HANGOVER state 830 as represented by arrow 860. Consequently, only 
after a predetermined number (Q) of subsequent audio frames are determined 
to be silence (# of frames > Q), state machine 800 enters into SILENCE 
SUPPRESSION state 820 as represented by arrow 865. However, if prior to 

10 that time, the sampled audio frame is determined to be voice, state machine 
enters into VOICE state 810 as represented by arrow 850. As a result of these 
operations, speech clipping is substantially avoided. 

While certain exemplary embodiments have been described and 
shown in the accompanying drawings, it is to be understood that such 
15 embodiments are merely illustrative of and not restrictive on the broad 

invention, and that this invention not be limited to the specific constructions 
and arrangements shown and described, since various other modifications 
may occur to those ordinarily skilled in the art. 
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CLAIMS 



What is claimed is: 



1 1. A method for enhancing voice activity detection comprising: 

2 determining a peak-to-mean likelihood ratio; and 

3 comparing the peak-to-mean likelihood ratio to a selected threshold to 

4 determine whether a current audio frame represents a voice signal 

1 2. The method of claim 1, wherein prior to determining the peak- 

2 to-mean likelihood ratio, the method further comprises: 

3 determining a short-term averaged energy for the current audio frame; 

4 and 

5 determining a long-term averaged energy for the current audio frame. 

1 3. The method of claim 2, wherein after determining the short- 

2 term averaged energy and the long-term averaged energy, the method further 

3 comprises: 

4 determining whether a sum of the short-term averaged energy and a 

5 factor is greater than the long-term averaged energy; and 

6 determining that the current audio frame represents silence if the sum 

7 is less than the long-term averaged energy, without necessitating a 

8 determination of the peak-to-mean likelihood ratio. 
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1 4. The method of claim 3, upon determining that the sum is 

2 greater than the long-term averaged energy and before determining the peak- 

3 to-mean likelihood ratio, the method further comprises: 

4 determining whether a difference between the long-term averaged 

5 energy and the short-term averaged energy is less than a predetermined 

6 threshold; 

7 determining that the current audio frame represents voice if the 

8 difference is greater than the predetermined threshold; and 

9 continuing by determining the peak-to-mean likelihood ratio if the 
10 difference is less than the predetermined threshold. 

1 5. The method of claim 2, wherein the determining of the short- 

2 term averaged energy comprises: 

3 determining an energy, in decibels, of the current audio frame; 

4 determining a short-term averaged energy for a prior audio frame; and 

5 conducting a weighted average of the energy of the current audio frame 

6 and the short-term averaged energy for the prior audio frame. 

1 6. The method of claim 1, wherein the determining a peak-to- 

2 mean likelihood ratio comprises 

3 calculating an averaged peak-to-mean ratio for the current audio 

4 frame; 

5 determining a maximum averaged peak-to-mean ratio; 

6 determining a minimum averaged peak-to-mean ratio; 
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7 determining a first result being a difference between the maximum 

8 averaged peak-to-mean ratio and the averaged peak-to-mean ratio for the 

9 current audio frame; 

10 determining a second result being a difference betv^een the maximum 

11 averaged peak-to-mean ratio and the minimum averaged peak-to-mean ratio; 

12 and 

13 conducting a ratio between the first result and the second result to 

14 produce the peak-to-mean likelihood ratio. 

1 7. A communication module comprising: 

2 a substrate; 

3 a processing unit placed on the substrate; and 

4 a memory coupled to the processing unit, the memory to contain a 



5 voice activity detector which, when executed by the processing unit, analyzes 

6 a short-term averaged energy, a long-term averaged energy, and a peak-to- 

7 mean likelihood ratio in order to determine whether a current audio frame 

8 represents voice or silence. 

1 8. The communication module of claim 7, wherein the voice 

2 activity detector, when executed, controls the processing unit to determine 

3 whether a sum of the short-term averaged energy and a predetermined factor 

4 is greater than the long-term averaged energy, and to signal that the current 

5 audio frame represents silence if the sum is less than the long-term averaged 

6 energy. 



1 
2 



9. The communication module of claim 8, wherein the voice 
activity detector, when executed, controls the processing unit to determine 
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3 whether a difference between the long-term averaged energy and the short- 

4 term averaged energy is less than a predetermined threshold, and to signal 

5 that the current audio frame represents voice if the difference is greater than 

6 the predetermined threshold. 

1 10. The communication module of claim 9, wherein the voice 

2 activity detector, when executed, controls the processing unit to determine 

3 the peak-to-mean likelihood ratio, and to compare the peak-to-mean 

4 likelihood ratio to a selected threshold to determine whether a current audio 

5 frame represents a voice signal. 

1 11. The communication module of claim 10, wherein the voice 

2 activity detector, when executed, controls the processing unit to determine a 

3 peak-to-mean ratio by (i) sampling an analog signal a predetermined number 

4 of times to produce a plurality of sampled signals each having a sampled 

5 value, (ii) determining a maximum value of the pluraUty of sampled signals, 

6 and (iii) conducting a ratio between an absolute value of the maximum value 

7 and a summation of the sampled values for the plurality of sampled signals. 

1 12. The communication module of claim 10, wherein the voice 

2 activity detector, when executed, controls the processing unit to determine an 

3 averaged peak-to-mean ratio for the current audio frame by (i) monitoring a 

4 maximum averaged peak-to-mean ratio and a minimum averaged peak-to- 

5 mean ratio, (ii) determining a first result being a difference between the 

6 maximum averaged peak-to-mean ratio and the averaged peak-to-mean ratio 

7 for the current audio frame, (iii) determining a second result being a 

8 difference between the maximum averaged peak-to-mean ratio and the 
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9 minimum averaged peak-to-mean ratio, and (iv) conducting a ratio between 

10 the first result and the second result to produce the peak-to-mean likelihood 

11 ratio. 

1 13. A machine readable medium having embodied thereon a 

2 computer program for processing by a machine, the computer program 

3 comprising: 

4 a first routine for determining a peak-to-mean likelihood ratio; and 

5 a second routine for comparing the peak-to-mean likelihood ratio to a 

6 selected threshold to determine whether an audio frame being transmitted 

7 represents a voice signal. 

1 14. The machine readable medium of claim 13, wherein the 

2 computer program further comprising: 

3 a third routine for determining a short-term averaged energy for the 

4 audio frame, the third routine being executed before the first and second 

5 routines; and 

6 a fourth routine for determining a long-term averaged energy for the 

7 audio frame, the fourth routine being executed before the first and second 

8 routines. 

1 15. The machine readable medium of claim 14, wherein the 

2 computer program further comprising: 

3 a fifth routine for determining whether a sum of the short-term 

4 averaged energy and a predetermined factor is greater than the long-term 

5 averaged energy, the fifth routine being executed before the first and second 

6 routines; and 
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7 a sixth routine for determining whether a difference between the long- 

8 term averaged energy and the short-term averaged energy is less than a 

9 predetermined threshold, the sixth routine being executed after determining 

10 that the sum is greater than the long-term averaged energy and before 

11 execution of the first and second routines. 

1 16. The machine readable medium of claim 15, wherein the fifth 

2 routine determining that the current audio frame represents silence if the 

3 sum is less than the long-term averaged energy. 

1 17. The machine readable medium of claim 15, wherein the sixth 

2 routine determining that the current audio frame represents voice if the 

3 difference is greater than the predetermined threshold. 

1 18. A voice activity detector comprising: 

2 circuitry to determine a short-term averaged energy for an audio frame; 

3 circuitry to determine a long-term averaged energy for the audio frame; 

4 circuitry to determine whether the short-term averaged energy is 

5 greater than the long-term averaged energy by a predetermined factor; 

6 circuitry to determine whether a difference between the long-term 

7 averaged energy and the short-term averaged energy is less than a 

8 predetermined threshold when the short-term averaged energy is greater 

9 than the long-term averaged energy by the predetermined factor; 

10 circuitry to determine a peak-to-mean likelihood ratio when the 

11 difference between the long-term averaged energy and the short-term 

12 averaged energy is less than the predetermined threshold; and 
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13 circuitry to comparing the peak-to-meari likelihood ratio to a selected 

14 threshold and to determine that the audio frame represents a voice signal 

15 when the peak-to-mean likelihood ratio is greater than a selected threshold. 
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ABSTRACT OF THE DISCLOSURE 



A voice activity detector to analyze a short-term averaged energy 
(STAE), a long-term averaged energy (LTAE), and a peak-to-mean likelihood 
ratio (PMLR) in order to determine whether a current audio frame being 
5 transmitted represents voice or silence. This is accomplished by determining 
whether a sum of the STAE and a factor is greater than the LTAE. If not, the 
current audio frame represents silence. If so, a second set of determinations is 
performed. Herein, a determination is made as to whether the difference 
between the LTAE and the STAE is less than a predetermined threshold. If 
10 so, the current audio frame represents voice. Otherwise, the PMLR is 

determined and compared to a selected threshold. If the PMLR is greater than 
the selected threshold, the current audio frame represents a voice signal. 
Otherwise, it represents silence. 
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